Distributed NLP
نویسندگان
چکیده
In this paper we present the performance of parallel text processing with Map Reduce on a cloud platform. Scientific papers in Turkish language are processed using Zemberek NLP library. Experiments were run on a Hadoop cluster and compared with the single machine’s performance.
منابع مشابه
An Open Distributed Architecture for Reuse and Integration of Heterogeneous NLP Components
The shift from Computational Linguistics to Language Engineering is indicative of new trends in NLP. This paper reviews two NLP engineering problems: reuse and integration, while relating these concerns to the larger context of applied NLP. It presents a software architecture which is geared to support the development of a variety of large-scale NLP applications: Information Retrieval, Corpus P...
متن کاملDashboard: A Tool for Integration, Validation, and Visualization of Distributed NLP Systems on Heterogeneous Platforms
Dashboard is a tool for integration, validation, and visualization of Natural Language Processing (NLP) systems. It provides infrastructural facilities using which individual NLP modules may be evaluated and refined, and multiple NLP modules may be combined to build a large end-user NLP system. It helps system integration team to integrate and validate NLP systems. The tool provides a visualiza...
متن کاملA Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents
The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order ...
متن کاملDistributed Parse Mining
We describe the design and implementation of a system for data exploration over dependency parses and derived semantic representations in a large-scale NLP-based search system at powerset.com. Because of the distributed nature of the document repository and the processing infrastructure, and also the complex representations of the corpus data, standard text analysis tools such as grep or awk or...
متن کاملDistributed Asynchronous Online Learning for Natural Language Processing
Recent speed-ups for training large-scale models like those found in statistical NLP exploit distributed computing (either on multicore or “cloud” architectures) and rapidly converging online learning algorithms. Here we aim to combine the two. We focus on distributed, “mini-batch” learners that make frequent updates asynchronously (Nedic et al., 2001; Langford et al., 2009). We generalize exis...
متن کاملConverting Unicode Lexicon and Lexical Tools for ASCII NLP Applications
The NLP SPECIALIST Lexicon and Lexical Tools, distributed by National Library of Medicine (NLM), have been released in Unicode (UTF-8) format since 2006. Lexicon is used as corpus while Lexical Tools are used as software packages in NLP (Natural Language Processing) projects. Some NLP projects still only deal with ASCII (7-bit) characters. This paper describes how to convert UTF-8 Lexicon and i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.03606 شماره
صفحات -
تاریخ انتشار 2018